System for Annotating Media Content for Automatic Content Understanding

ABSTRACT

A system for annotating frames in a media stream frames includes a pattern recognition system (PRS) to generate PRS output metadata for a frame; an archive for storing ground truth metadata (GTM); a device to merge the GTM and PRS output metadata and thereby generate proposed annotation data (PAD); and a user interface for use by the HA. The user interface includes an editor and an input device used by the HA to approve GTM for the frame. An optimization system receives the approved GTM and metadata output by the PRS, and adjusts input parameters for the PRS to minimize a distance metric corresponding to a difference between the GTM and PRS output metadata.

CROSS REFERENCE TO RELATED PATENT APPLICATION

This patent application claims a benefit to the priority date of thefiling of U.S. Provisional Patent Application Ser. No. 61/637,344,titled “System for Annotating Media Content for Improved AutomaticContent Understanding Performance,” by Petajan et al., that was filed onApr. 24, 2012. The disclosure of U.S. 61/637,344 is incorporated byreference herein in its entirety.

FIELD OF THE DISCLOSURE

This disclosure relates to media presentations (e.g. live sportsevents), and more particularly to a system for improving performance bygenerating annotations for the media stream.

BACKGROUND OF THE DISCLOSURE

A media presentation, such as a broadcast of an event, may be understoodas a stream of audio/video frames (live media stream). It is desirableto add information to the media stream to enhance the viewer'sexperience; this is generally referred to as annotating the mediastream. The annotation of a media stream is a tedious and time-consumingtask for a human. Visual inspection of text, players, balls, andfield/court position is mentally taxing and error prone. Keyboard andmouse entry are needed to enter annotation data but are also error proneand mentally taxing. Accordingly, systems have been developed to atleast partially automate the annotation process.

Pattern Recognition Systems (PRS), e.g. computer vision or AutomaticSpeech Recognition (ASR), process media streams in order to generatemeaningful metadata. Recognition systems operating on natural mediastreams always perform with less than absolute accuracy due to thepresence of noise. Computer Vision (CV) is notoriously error prone andASR is only useable under constrained conditions. The measurement ofsystem accuracy requires knowledge of the correct PRS result, referredto here as Ground Truth Metadata (GTM). The development of a PRSrequires the generation of GTM that must be validated by HumanAnnotators (HA). GTM can consist of positions in space or time, labeledfeatures, events, text, region boundaries, or any data with a uniquelabel that allows referencing and comparison.

A compilation of acronyms used herein is appended to this Specification.

There remains a need for a system that can reduce the human time andeffort required to create the GTM.

SUMMARY OF THE DISCLOSURE

We refer to a system for labeling features in a given frame of video (oraudio) or events at a given point in time as a Media Stream Annotator(MSA). If accurate enough, a given PRS automatically generates metadatafrom the media streams that can be used to reduce the human time andeffort required to create the GTM. According to an aspect of thedisclosure, an MSA system and process, with a Human-Computer Interface(HCI), provides more efficient GTM generation and PRS input parameteradjustment.

GTM is used to verify PRS accuracy and adjust PRS input parameters or toguide algorithm development for optimal recognition accuracy. The GTMcan be generated at low levels of detail in space and time, or at higherlevels as events or states with start times and durations that may beimprecise compared to low-level video frame timing.

Adjustments to PRS input parameters that are designed to be staticduring a program should be applied to all sections of a program withassociated GTM in order to maximize the average recognition accuracy andnot just the accuracy of the given section or video frame. If the MSAprocesses live media, the effect of any automated PRS input parameteradjustments must be measured on all sections with (past and present) GTMbefore committing the changes for generation of final production output.

A system embodying the disclosure may be applied to both live andarchived media programs and has the following features:

-   -   Random access into a given frame or section of the archived        media stream and associated metadata    -   Real-time display or graphic overlay of PRS-generated metadata        on or near video frame display    -   Single click approval of conversion of Proposed Annotation Data        (PAD) into GTM    -   PRS recomputes all metadata when GTM changes    -   Merge metadata from 3rd parties with human annotations    -   Graphic overlay of compressed and decoded metadata on or near        decoded low bit-rate video to enable real-time operation on        mobile devices and consumer-grade internet connections

The foregoing has outlined, rather broadly, the preferred features ofthe present disclosure so that those skilled in the art may betterunderstand the detailed description of the disclosure that follows.Additional features of the disclosure will be described hereinafter thatform the subject of the claims of the disclosure. Those skilled in theart should appreciate that they can readily use the disclosed conceptionand specific embodiment as a basis for designing or modifying otherstructures for carrying out the same purposes of the present disclosureand that such other structures do not depart from the spirit and scopeof the disclosure in its broadest form.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic illustration of the Media Stream Annotator (MSA),according to an embodiment of the disclosure.

FIG. 2 is a schematic illustration of the Media Annotator flow chartduring Third Party Metadata (TPM) ingest, according to an embodiment ofthe disclosure.

FIG. 3 is a schematic illustration of the Media Annotator flow chartduring Human Annotation, according to an embodiment of the disclosure.

FIG. 4 is a schematic illustration of a football miniboard, according toan embodiment of the disclosure.

DETAILED DESCRIPTION

The accuracy of any PRS depends on the application of constraints thatreduce the number or range of possible results. These constraints cantake the form of a priori information, physical and logical constraints,or partial recognition results with high reliability. A prioriinformation for sports includes the type of sport, stadium architectureand location, date and time, teams, players, broadcaster, language, andthe media ingest process (e.g., original A/V resolution andtranscoding). Physical constraints include camera inertia, camera mounttype, lighting, and the physics of players, balls, equipment, courts,fields, and boundaries. Logical constraints include the rules of thegame, sports production methods, uniform colors and patterns, andscoreboard operation. Some information can be reliably extracted fromthe media stream with minimal a priori information and can be used to“boot strap” subsequent recognition processes. For example, the presenceof the graphical miniboard overlaid on the game video (shown in FIG. 4)can be detected with only knowledge of the sport and the broadcaster(e.g., ESPN, FOX Sports, etc).

If a live media sporting event is processed in real time, only thecurrent and past media streams are available for pattern recognition andmetadata generation. A recorded sporting event can be processed withaccess to any frame in the entire program. The PRS processing a liveevent can become more accurate as time progresses since more informationis available over time, while any frame from a recorded event can beanalyzed repeatedly from the past or the future until maximum accuracyis achieved.

The annotation of a media stream is a tedious and time-consuming taskfor a human. Visual inspection of text, players, balls, and field/courtposition is mentally taxing and error prone. Keyboard and mouse entryare needed to enter annotation data but are also error prone andmentally taxing. Human annotation productivity (speed and accuracy) isgreatly improved by properly displaying available automaticallygenerated Proposed Annotation Data (PAD) and thereby minimizing themouse and keyboard input needed to edit and approve the PAD. If the PADis correct, the Human Annotator (HA) can simultaneously approve thecurrent frame and select the next frame for annotation with only onepress of a key or mouse button. The PAD is the current bestautomatically generated metadata that can be delivered to the userwithout significant delay. Waiting for the system to maximize theaccuracy of the PAD may decrease editing by the HA but will also delaythe approval of the given frame.

FIG. 1 shows a Media Stream Annotator (MSA) system according to anembodiment of the disclosure. The MSA ingests both live and archivedmedia streams (LMS 114 and AMS 115), and optional Third Party Metadata(TPM) 101 and input from the HA 118. The PAD is derived from acombination of PRS 108 result metadata and TPM 101. Metadata output byPRS 108 is archived in Metadata Archive 109. If the TPM 101 is availableduring live events the system can convert the TPM 101 to GTM via theMetadata Mapper 102 and then use the Performance Optimization System(POS) 105 to adjust PRS Input Parameters to improve metadata accuracyfor both past (AMS 115) and presently ingested media (LMS 114). The PADEncoder 110 merges GTM with metadata for each media frame and encodesthe PAD into a compressed form suitable for transmission to the HumanAnnotator User Interface (HAUI) 104 via a suitable network, e.g.Internet 103. This information is subsequently decoded and displayed tothe HA, in a form the HA can edit, by a Media Stream and PAD Decoder,Display and Editor (MSPDE) 111. The HAUI also includes a Media StreamNavigator (MSN) 117 which the HA uses to select time points in the mediastream whose corresponding frames are to be annotated. A low bit-rateversion of the media stream is transcoded from the AMS by a MediaTranscoder 116 and then transmitted to the HAUI.

As GTM is generated by the HA 118 and stored in the GTM Archive 106, thePOS 105 compares the PRS 108 output metadata to the GTM and detectssignificant differences between them. During the design and developmentof the PRS 108, input parameters are set with initial estimated valuesthat produce accurate results on an example set of media streams andassociated GTM. These parameter values are adjusted by the POS 105 untilthe difference between the all GTM and the PRS 108 generated metadata isminimized.

During development (as opposed to live production) the POS 105 does notneed to operate in real time and exhaustive optimization algorithms maybe used. During a live program the POS 105 should operate as fast aspossible to improve PRS 108 performance each time new GTM is generatedby the HA 118; faster optimization algorithms are therefore used duringa live program. The POS 105 is also invoked when new TPM 101 isconverted to GTM.

The choice of distance metric between PRS 108 output metadata and GTMdepends on the type of data and the allowable variation. For example, ina presentation of a football game the score information extracted fromthe miniboard must be absolutely accurate while the spatial position ofa player on the field can vary. If one PRS input parameter affectsmultiple types of results, then the distance values for each type can beweighted in a linear combination of distances in order to calculate asingle distance for a given frame or time segment of the game.

A variety of TPM 101 (e.g. from stats.com) is available after a delayperiod from the live action that can be used as GTM either duringdevelopment or after the delay period during a live program. Since theTPM is delayed by a non-specific period of time, it must be aligned intime with the program. Alignment can either be done manually, or the GTMcan be aligned with TPM 101, and/or the PRS 108 result metadata can bealigned using fuzzy matching techniques.

The PRS 108 maintains a set of state variables that change over time asmodels of the environment, players, overlay graphics, cameras, andweather are updated. The arrival of TPM 101 and, in turn, GTM can drivechanges to both current and past state variables. If the history of thestate variables is not stored persistently, the POS 105 would have tostart the media stream from the beginning in order to use the PRS 108 toregenerate metadata using new PRS 108 Input Parameters. The amount ofPRS 108 state variable information can be large, and is compressed usingState Codec 112 into one or more sequences of Group Of States (GOS) suchthat a temporal section of PRS States is encoded and decoded as a groupfor greater compression efficiency and retrieval speed. The GOS isstored in a GOS Archive 113. The number of media frames in a GOS can beas few as one.

If the PRS 108 result metadata is stored persistently, the HA cannavigate to a past point in time and immediately retrieve the associatedmetadata or GTM via the PAD Encoder 110, which formats and compressesthe PAD for delivery to the HA 118 over the network.

FIG. 2 shows a flow chart for MSA operation, according to an embodimentof the disclosure in which both a live media stream (LMS) and TPM areingested. All LMS is archived in the AMS (step 201). At system startup,the initial or default values of the GOS are input to the PRS which thenstarts processing the LMS in real time (step 202). If the PRS does nothave sufficient resources to process every LMS frame, the PRS will skipframes to minimize the latency between a given LMS frame and itsassociated result Metadata (step 203). Periodically, the internal statevariable values of the PRS are encoded into GOS and archived (step 204).Finally, the PRS generates metadata which is archived (step 205); theprocess returns to step 201 and the next or most recent next media frameis ingested. The processing loop 201-205 may iterate indefinitely.

When TPM arrives via the Internet, it is merged with any GTM that existsfor that media frame via the Metadata Mapper (step 206). The POS is thennotified of the new GTM and generates new sets of PRS Input Parameters,while comparing all resulting Metadata to any corresponding GTM for eachset until an optimal set of PRS Input Parameters are found that minimizethe global distance between all GTM and the corresponding Metadata (step207).

FIG. 3 shows a flow chart for MSA operation while the HA approves newGTM. This process operates in parallel with the process shown in theflowchart of FIG. 2. The HA must first select a point on the mediastream timeline for annotation (step 301). The HA can find a point intime by dragging a graphical cursor on a media player while viewing alow bit-rate version of the media stream transcoded from the AMS (step302). The Metadata and any existing GTM associated with the selectedtime point are retrieved from their respective archives 109, 106 andencoded into the PAD (step 303); transmitted with the Media Stream tothe HAUI over the Internet (step 304);and presented to the HA via theHAUI after decoding both PAD and low bit-rate Media Stream (step 305).The HAUI displays the PAD on or near the displayed Media Frame (step306). The HA compares the PAD with the Media Frame and either clicks onan Approve button 107 or corrects the PAD using an editor and approvesthe PAD (step 307). After approval of the PAD, the HAUI transmits thecorrected and/or approved PAD as new GTM for storage in the GTM Archive(step 308). The POS is then notified of the new GTM and generates newsets of PRS Input Parameters, while comparing all resulting Metadata toany corresponding GTM for each set (step 309) until an optimal set ofPRS Input Parameters are found that minimize the global distance betweenall GTM and the corresponding Metadata (step 310).

If the MSA is operating only on the AMS (and not on the LMS), the POScan perform more exhaustive and time consuming algorithms to minimizethe distance between GTM and Metadata; the consequence of incomplete orless accurate Metadata is more editing time for the HA. If the MSA isoperating on LMS during live production, the POS is constrained to notupdate the PRS Input Parameters for live production until the Metadataaccuracy is maximized.

The HA does not need any special skills other than a basic knowledge ofthe media stream content (e.g. rules of the sporting event) and facilitywith a basic computer interface. PRS performance depends on thecollection of large amounts of GTM to ensure that optimization by thePOS will result in optimal PRS performance on new media streams.Accordingly, it is usually advantageous to employ multiple HAs for agiven media stream. The pool of HAs is increased if the HAUI client cancommunicate with the rest of the system over the consumer-grade internetor mobile internet connections which have limited capacity. The mainconsumer of internet capacity is the media stream that is delivered tothe HAUI for decoding and display. Fortunately, the bit-rate of themedia stream can be greatly lowered to allow carriage over consumer ormobile internet connections by transcoding the video to a lowerresolution and quality. Much of the bit-rate needed for high qualitycompression of sporting events is applied to complex regions in thevideo, such as views containing the numerous spectators at the event;however, the HA does not need high quality video of the spectators forannotation. Instead, the HA needs a minimal visual quality for theminiboard, player identification, ball tracking, and field markingswhich is easily achieved with a minimal compressed bit-rate.

The PAD is also transmitted to the HAUI, but this information is easilycompressed as text, graphical coordinates, geometric objects, colorproperties or animation data. All PAD can be losslessly compressed usingstatistical compression techniques (e.g. zip), but animation data can behighly compressed using lossy animation stream codecs such as can befound in the MPEG-4 SNHC standard tools (e.g. Face and Body Animationand 3D Mesh Coding).

The display of the transmitted and decoded PAD to the HA is arranged forclearest viewing and comparison between the video and the PAD. Forexample, as shown in FIG. 4, the miniboard content from the PAD shouldbe displayed below the video frame in its own window pane 402 andvertically aligned with the miniboard in the video 401. PAD contentrelating to natural (non-graphical) objects in the video should begraphically overlayed on the video.

Editing of the PAD by the HA can be done either in the miniboard textwindow directly for miniboard data or by dragging spatial location datadirectly on the video into the correct position (e.g. field lines orplayer IDs). The combined use of low bit-rate, adequate quality videoand compressed text, graphics and animation data which is composited onthe video results in a HAUI that can be used with low bit-rate internetconnections.

Referring back to FIG. 1, The Metadata Archive 109 and the GTM Archive106 are ideally designed and implemented to provide fast in-memoryaccess to metadata while writing archive contents to disk as often asneeded to allow fast recovery after system failure (power outage, etc).In addition to the inherent speed of memory access (vs disk access), themetadata archives should ideally be architected to provide fast searchand data derivation operations. Fast search is needed to findcorresponding entries in the GTM 106 vs Metadata 109 archives, and tosupport the asynchronous writes to the GTM Archive 106 from the MetadataMapper 102. Preferred designs of the data structures in the archivesthat support fast search include the use of linked lists and hashtables. Linked lists enable insert edit operations without the need tomove blocks of data to accommodate new data. Hash tables provide fastaddress lookup of sparse datasets.

The ingest of TPM 101 requires that the TPM timestamps be aligned withthe GTM 106 and Metadata 109 Archive timestamps. This alignmentoperation may involve multiple passes over all datasets whilecalculating accumulated distance metrics to guide the alignment. Theingest of multiple overlapping/redundant TPM requires that a policy beestablished for dealing with conflicting or inconsistent metadata. Incase there is conflict between TPMs 101, the Metadata Mapper 102 shouldideally compare the PRS 108 generated Metadata 109 to the conflictingTPMs 101 in case other prior knowledge does not resolve the conflict. Ifthe conflict can't be reliably resolved, then a confidence value shouldideally be established for the given metadata which is also stored inthe GTM 106. Alternatively, conflicting data can be omitted from the GTM106.

The GTM 106 and Metadata 109 Archives should ideally contain processesfor efficiently performing common operations on the archives. Forexample, if the time base of the metadata needs adjustment, an internalarchive process could adjust each timestamp in the whole archive withoutimpacting other communication channels, or tying up other processingresources.

An example of TPM is the game clock from a live sporting event. TPM gameclocks typically consist of an individual message for each tick/secondof the clock containing the clock value. The delay between the liveclock value at the sports venue and the delivered clock value messagecan be seconds or tens of seconds with variation. The PRS is recognizingthe clock from the live video feed and the start time of the game ispublished in advance. The Metadata Mapper 102 should use all of thisinformation to accurately align the TPM clock ticks with the time baseof the GTM 106 and Metadata 109 Archives. At the beginning of the game,there might not be enough data to determine this alignment veryaccurately, but as time moves forward, more metadata is accumulated andpast alignments can be update to greater accuracy.

Another desirable feature of the GTM 106 and Metadata 109 archives isthe ability to virtually repopulate the archives as an emulation ofreplaying of the original ingest and processing of the TPM. Thisemulation feature is useful for system tuning and debugging.

While the disclosure has been described in terms of specificembodiments, it is evident in view of the foregoing description thatnumerous alternatives, modifications and variations will be apparent tothose skilled in the art. Accordingly, the disclosure is intended toencompass all such alternatives, modifications and variations which fallwithin the scope and spirit of the disclosure and the following claims.

COMPILATION OF ACRONYMS

-   AMS Archived Media Stream-   ASR Automatic Speech Recognition-   CV Computer Vision-   GOS Group Of States-   GTM Ground Truth Metadata-   HA Human Annotators-   HAUI Human Annotator User Interface-   HCI Human Computer Interface-   LMS Live Media Stream-   MSA Media Stream Annotator-   MSN Media Stream Navigator-   MSPDE Media Stream and PAD Decoder-   PAD Proposed Annotation Data-   POS Performance Optimization System-   PRS Pattern Recognition System-   TPM Third Party Metadata

We claim:
 1. A system to annotate media content, comprising: a patternrecognition system (PRS) having an initial set of input parameters thatgenerates PRS output metadata associated with a frame of a media stream;an archive for storing ground truth metadata (GTM) associated with thesame frame of the media stream; a device to merge the GTM and the PRSoutput metadata and thereby generate proposed annotation data (PAD); anda user interface for use by a human annotator (HA) including an editorand an input device to approve or edit the PAD for the frame; and anoptimization system to adjust input parameters for the PRS to minimize adistance metric corresponding to a difference between the GTM and PRSoutput metadata.
 2. The system of claim 1 wherein the GTM is obtainedfrom one or more of third party metadata, archived media stream and theHA.
 3. The system of claim 2 wherein a time delay between third partymetadata and the media stream is corrected by alignment.
 4. The systemof claim 2 including a communication network to enable a plurality ofHA's to interface with the same media stream.
 5. The system of claim 2wherein when the PAD is approved it is converted to GTM.
 6. The systemof claim 5 wherein when the PAD is approved, it is graphically overlayedon the media stream.
 7. The system of claim 1 wherein the optimizationsystem adjusts the PRS initial set of input parameters to minimize thedifference between the GTM and PRS output metadata thereby increasingaccuracy.
 8. The system of claim 1 wherein the PRS includes a set ofstate variables stored as a temporal group adjustable as a group inresponse to GTM.
 9. A method comprising: receiving data from a mediastream, the data organized into frames; processing the data using apattern recognition system (PRS); storing a state of the PRS; generatingmetadata associated with the frame using the PRS; receiving inputcharacterized as ground truth metadata (GTM), into an optimizationsystem; adjusting input parameters for the PRS to minimize a distancemetric corresponding to a difference between the GTM and PRS outputmetadata.
 10. The method of claim 9 wherein said input is obtained fromone or more of archived media streams, third party metadata and one ormore human annotators.
 11. The method of claim 10 wherein subsequent toreceiving said input, said GTM and said metadata associated with saidPRS are temporally aligned.
 12. The method of claim 10 wherein said GTMand said metadata associated with said PRS are continuously stored andmemory and periodically stored to disk thereby enabling fast recoveryfrom system failure.
 13. A method comprising receiving from a humanannotator (HA), via a human annotator user interface (HAUI), informationregarding a time point selected by the HA on a timeline of the mediastream; merging existing ground truth metadata (GTM) relating to a mediaframe corresponding to the selected time point with pattern recognitionsystem (PRS) output metadata relating to said media frame, therebygenerating proposed annotation data (PAD) for the media frame;displaying the media frame and the PAD to the HA; receiving input fromthe HA including correction and/or approval of the PAD, where approvedPAD is characterized as new GTM related to the selected time point;storing the new GTM; comparing the PRS output metadata and the new GTMrelated to the selected time point; and adjusting PRS input parametersso that a distance metric corresponding to a difference between the newGTM and PRS output metadata related to the selected time point isminimized.
 14. The method of claim 13 wherein said GTM is obtained fromone or more of archived media streams, third party metadata, said humanannotators and other human annotators.
 15. The method of claim 14wherein when said human annotator approves said PAD, said PAD isgraphically overlaid on said media stream.
 16. A method comprising:generating output metadata associated with a frame of a media stream,output by a pattern recognition system (PRS); storing in an archiveinput from a human annotator (HA) related to the frame, characterized asground truth metadata (GTM); merging the GTM and the PRS output metadatato thereby generate proposed annotation data (PAD); and displaying thePAD to the HA by a user interface; receiving via the user interface aninput from the HA indicating approval of the GTM for the frame; andadjusting input parameters for the PRS using an optimization system, tominimize a distance metric corresponding to a difference between the GTMand the PRS output metadata.
 17. The method of claim 16 wherein said GTMis obtained from one or more of archived media streams, third partymetadata, said human annotators and other human annotators.
 18. Themethod of claim 17 wherein when said human annotator approves said PAD,said PAD is graphically overlaid on said media stream.